{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Notebook 4.1: ChatGLM2-6B\n",
    "\n",
    "## 4.1.1 Overview\n",
    "This example shows how to run [ChatGLM2-6B](https://github.com/THUDM/ChatGLM2-6B) Chinese inference on low-cost PCs (without the need of discrete GPU) using [BigDL-LLM](https://github.com/intel-analytics/BigDL/tree/main/python/llm) APIs. ChatGLM2-6B is the second-generation version of the open-source bilingual (Chinese-English) chat model [ChatGLM-6B](https://github.com/THUDM/ChatGLM-6B) proposed by [THUDM](https://github.com/THUDM). ChatGLM2-6B also can be found in [Huggingface models](https://huggingface.co/models) in following [link](https://huggingface.co/THUDM/chatglm2-6b).\n",
    "\n",
    "Before conducting inference, you may need to prepare environment according to [Chapter 2](../ch_2_Environment_Setup/README.md)."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.2 Installation\n",
    "\n",
    "First of all, install BigDL-LLM in your prepared environment. For best practices of environment setup, refer to [Chapter 2](../ch_2_Environment_Setup/README.md) in this tutorial."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --pre --upgrade bigdl-llm[all]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The all option is for installing other required packages by BigDL-LLM."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.3 Load Model and Tokenizer\n",
    "\n",
    "### 4.1.3.1 Load Model\n",
    "\n",
    "Load ChatGLM2 model with low-bit optimization(INT4) for lower resource cost using BigDL-LLM APIs, which convert the relevant layers in the model into INT4 format.\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> BigDL-LLM has supported `AutoModel`, `AutoModelForCausalLM`, `AutoModelForSpeechSeq2Seq` and `AutoModelForSeq2SeqLM`. The AutoClasses help users automatically retrieve the relevant model, in this case, we can simply use `AutoModel` to load.\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> You can specify the argument `model_path` with both Huggingface repo id or local model path."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from bigdl.llm.transformers import AutoModel\n",
    "\n",
    "model_path = \"THUDM/chatglm2-6b\"\n",
    "model = AutoModel.from_pretrained(model_path,\n",
    "                                  load_in_4bit=True,\n",
    "                                  trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.3.2 Load Tokenizer\n",
    "\n",
    "A tokenizer is also needed for LLM inference. It is used to encode input texts to tensors to feed to LLMs, and decode the LLM output tensors to texts. You can use [Huggingface transformers](https://huggingface.co/docs/transformers/index) API to load the tokenizer directly. It can be used seamlessly with models loaded by BigDL-LLM."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "metadata": {},
   "outputs": [],
   "source": [
    "from transformers import AutoTokenizer\n",
    "\n",
    "tokenizer = AutoTokenizer.from_pretrained(model_path,\n",
    "                                          trust_remote_code=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.4 Inference\n",
    "\n",
    "### 4.1.4.1 Create Prompt Template\n",
    "\n",
    "Before generating, you need to create a prompt template. Here we give an example prompt template for question and answering refers to [ChatGLM2-6B prompt template](https://huggingface.co/THUDM/chatglm2-6b/blob/main/modeling_chatglm.py#L1007). You can tune the prompt based on your own model as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "metadata": {},
   "outputs": [],
   "source": [
    "CHATGLM_V2_PROMPT_TEMPLATE = \"问：{prompt}\\n\\n答：\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.4.2 Generate\n",
    "\n",
    "Then, you can generate output with loaded model and tokenizer.\n",
    "\n",
    "> **Note**\n",
    "> \n",
    "> `max_new_tokens` parameter in the `generate` function defines the maximum number of tokens to predict. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 25,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------- Output --------------------\n",
      "问:AI是什么?\n",
      "\n",
      "答: AI是人工智能(Artificial Intelligence)的缩写,指的是一种能够模拟人类智能的技术或系统。AI系统可以通过学习、推理、解决问题等方式,实现类似于\n"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "prompt = \"AI是什么？\"\n",
    "n_predict = 32\n",
    "\n",
    "with torch.inference_mode():\n",
    "    prompt = CHATGLM_V2_PROMPT_TEMPLATE.format(prompt=prompt)\n",
    "    input_ids = tokenizer.encode(prompt, return_tensors=\"pt\")\n",
    "    output = model.generate(input_ids,\n",
    "                            max_new_tokens=n_predict)\n",
    "    output_str = tokenizer.decode(output[0], skip_special_tokens=True)\n",
    "    print('-'*20, 'Output', '-'*20)\n",
    "    print(output_str)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.4.3 Stream Chat\n",
    "\n",
    "ChatGLM2-6B support streaming output function `stream_chat`, which enable the model to provide a streaming response word by word. However, other models may not provide similar APIs, if you want to implement general streaming output function, please refer to [Chapter 5.1](../ch_5_AppDev_Intermediate/5_1_ChatBot.ipynb).\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> To successfully observe the text streaming behavior in standard output, we need to set the environment variable `PYTHONUNBUFFERED=1 `to ensure that the standard output streams are directly sent to the terminal without being buffered first."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "-------------------- Stream Chat Output --------------------\n",
      "AI指的是人工智能,是一种能够通过学习和理解数据,以及应用适当的算法和数学模型,来执行与人类智能相似的任务的计算机程序。AI可以包括机器学习、自然语言处理、计算机视觉、专家系统、强化学习等不同类型的技术。\n",
      "\n",
      "AI的应用领域广泛,例如自然语言处理可用于语音识别、机器翻译、情感分析等;计算机视觉可用于人脸识别、图像识别、自动驾驶等;机器学习可用于预测、分类、聚类等数据分析任务。\n",
      "\n",
      "AI是一种非常有前途的技术,已经在许多领域产生了积极的影响,并随着技术的不断进步,将继续为我们的生活和工作带来更多的便利和改变。"
     ]
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "with torch.inference_mode():\n",
    "    question = \"AI 是什么？\"\n",
    "    response_ = \"\"\n",
    "    print('-'*20, 'Stream Chat Output', '-'*20)\n",
    "    for response, history in model.stream_chat(tokenizer, question, history=[]):\n",
    "        print(response.replace(response_, \"\"), end=\"\")\n",
    "        response_ = response"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.1.5 Use in LangChain\n",
    "\n",
    "[LangChain](https://python.langchain.com/docs/get_started/introduction.html) is a widely used framework for developing applications powered by language models. In this section, we will show how to integrate BigDL-LLM with LangChain. You can follow this [instruction](https://python.langchain.com/docs/get_started/installation) to prepare environment for LangChain."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "If you need, install LangChain as following, or you can refer to [Chapter 8](../ch_8_AppDev_Advanced/README.md) for more information about LangChain integrations:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install -U langchain==0.0.248"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "> **Note**\n",
    "> \n",
    "> We recommend to use `langchain==0.0.248`, which is verified in our tutorial."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.5.1 Create Prompt Template\n",
    "\n",
    "Before inference, you need to create a prompt template. Here we give an example prompt template for question and answering, which contains two input variables, `history` and `human_input`. You can tune the prompt based on your own model as well."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [],
   "source": [
    "CHATGLM_V2_LANGCHAIN_PROMPT_TEMPLATE = \"\"\"{history}\\n\\n问：{human_input}\\n\\n答：\"\"\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.5.2 Prepare Chain"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use [LangChain API](https://api.python.langchain.com/en/latest/api_reference.html) `LLMChain` to construct a chain for inference. Here we use BigDL-LLM APIs to construct a `LLM` object, which will load model with low-bit optimization automatically.\n",
    "\n",
    "> **Note**\n",
    ">\n",
    "> `ConversationBufferWindowMemory` is a type of memory in LangChain that keeps a sliding window of the most recent `k` interactions in a conversation."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from langchain import LLMChain, PromptTemplate\n",
    "from bigdl.llm.langchain.llms import TransformersLLM\n",
    "from langchain.memory import ConversationBufferWindowMemory\n",
    "\n",
    "llm_model_path = \"THUDM/chatglm2-6b\" # the path to the huggingface llm model\n",
    "\n",
    "prompt = PromptTemplate(input_variables=[\"history\", \"human_input\"], template=CHATGLM_V2_LANGCHAIN_PROMPT_TEMPLATE)\n",
    "max_new_tokens = 128\n",
    "\n",
    "llm = TransformersLLM.from_model_id(\n",
    "        model_id=llm_model_path,\n",
    "        model_kwargs={\"trust_remote_code\": True},\n",
    ")\n",
    "\n",
    "# Following code are complete the same as the use-case\n",
    "llm_chain = LLMChain(\n",
    "    llm=llm,\n",
    "    prompt=prompt,\n",
    "    verbose=True,\n",
    "    llm_kwargs={\"max_new_tokens\":max_new_tokens},\n",
    "    memory=ConversationBufferWindowMemory(k=2),\n",
    ")\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1.5.3 Generate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "\n",
      "\n",
      "\u001b[1m> Entering new LLMChain chain...\u001b[0m\n",
      "Prompt after formatting:\n",
      "\u001b[32;1m\u001b[1;3m\n",
      "\n",
      "问：AI 是什么？\n",
      "\n",
      "答：\u001b[0m\n",
      "AI指的是人工智能,是一种能够通过学习和理解数据,以及应用适当的算法和数学模型,来执行与人类智能相似的任务的技术。AI可以包括机器学习、自然语言处理、计算机视觉、知识表示、推理、决策等多种技术。\n",
      "\n",
      "\n",
      "\n",
      "\u001b[1m> Finished chain.\u001b[0m\n"
     ]
    }
   ],
   "source": [
    "text = \"AI 是什么？\"\n",
    "response_text = llm_chain.run(human_input=text,stop=\"\\n\\n\")\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "cn-eval",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.18"
  },
  "orig_nbformat": 4
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
